Cov19: Base-final evaluation: troponin = 'all'

This notebook shows the methodical way of analyzing the base data of COV19 (positive vs. negative): troponin = 'all'

  1. run 'train_test_split_analysis.py' with repeated randomized train-test-splits with different split-sizes -> evaluate the optimal fraction to split into train- and test data .

  2. selecting 'optimal' train-test-split-size-fraction and getting best model from all supported model types (LogisticRegression, KNeighborsClassifier, RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier, SVC). Best model = reaching best mean test-accuracy-score (on unseen data)

  3. With this model and its best hyperparameters retrain all data and analyze results with its FeatureImportance

1. train_test_split_analysis

In this section the behaviour between train-data-size and test-data-size is analyzed. Goal ist to get an optimal fraction the reach stable and valid results. An optimal fraction is reached if variance in results on train- and test-data is relativly low. Normally many train-data and few test-data leeds to a low variance in train but a high variance in test-results and visa verse. Somewhere between should be an optimum. This fraction is taken for further analysis!

From train-test-split-analysis the next 2 plots have been generated:

These plots show the estimated behaviour. With low test-sizes (0.1) the variance in test_accuracy is high. This variance is getting lower if the test_size is increased. From visual sight an optimum could be around a test-size of 0.25. At this point variance of train- and test-accuracy seems to be relatively low.

It's also visible that the GradientBoostingClassifier seems to overfit data as seen in the huge gap between mean_train_accuracy and test accuracy.

2. getting best model from all suported model types

In a second step the best model of all supported models and hyperparameters is selected. The best model is this one reaching the best mean test-accuracy-score (accuracy score on unseen data). After catching the optimal train-test-split-size of 0.25 only these results will be considered on next steps of this analysis!

Output of former analysis have been two .csv file with all best models (hyperparameters) per algorithm and train-test-split/random_state.

From this tables only test_size=0.25 will be taken for further analysis: The best model overall is the model with the best mean-test-accuracy (on test-size=0.25). It's best hyperparameters are detected in a majority-vote

3. Model evaluation and Feature Importance

In former chapters no word about the model pipelines and it's tuning parameters have been lost. These will be shown in this section. Will will analyze data and it's predictions with best evaluated model in section 2.

EDA

The missing value-plot is already shown in part 1 as well as the missing data per subgropup.

Next target variable has to be replaced by numeric values to calculate correlation of features to target.

Machine Learning

The pipeline for the ML-workflow looks as following:

Best hyperparameters as well as the performance statistics have been evaluated in section above: These are:

The model will be build from pipeline by replacing the dummy estimator by the best estimator and update the hyperparameters in the pipe. After this the pipe can be fitted on overall data!

Predictions

The performance in prediction (example) can be slightly optimistic because the model was refit on all data (no train-, test). But in statistics above the 'true' (estimated) statistic values are shown.

Confusion matrix as well as the classification report showing a well distributed picture. Precision, recall and f1-score are very close, so the model seems to work quit accurate!

Feature Importance

This section tries to highlight the most important features in the dataset. Two approaches are chosen:

  1. regression coefficients (only possible in regression models as LogisticRegression)
  2. Feature importance with SequentialFeatureSelection

A second approach for the feature importance is the sequential feature selection (SFS).

SFS does not work with a pipeline object, therefore the estimator will be removed, data will be preprocessed with pipe without estimator and this estimator is given to SFS.

Evaluate all (best) models (Feature Importance)